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Appealing to several multivariate information measures — some familiar, some new here — we ana- 
lyze the information embedded in discrete- valued stochastic time series. We dissect the uncertainty 
of a single observation to demonstrate how the measures' asymptotic behavior sheds structural and 
semantic light on the generating process's internal information dynamics. The measures scale with 
the length of time window, which captures both intensive (rates of growth) and subextensive com- 
ponents. We provide interpretations for the components, developing explicit relationships between 
them. We also identify the informational component shared between the past and the future that 
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notion of a process's effective (internal) states and indicates why one must build models. 
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A single measurement, when considered in the 
context of the past and the future, contains a 
wealth of information, including distinct kinds of 
information. Can the present measurement be 
predicted from the past? From the future? Or, 
only from them together? Or not at all? Is some 
of the measurement due to randomness? Does 
that randomness have consequences for the fu- 
ture or it is simply lost? We answer all of these 
questions and more, giving a complete dissection 
of a measured bit of information. 

I. INTRODUCTION 

In a time series of observations, what can we learn 
from just a single observation? If the series is a se- 
quence of coin flips, a single observation tells us noth- 
ing of the past nor of the future. It gives a single bit 
of information about the present — one bit out of the in- 
finite amount the time series contains. However, if the 
time series is periodic — say, alternating 0s and Is — then 
with a single measurement in hand, the entire observa- 
tion series need not be stored; it can be substantially 



compressed. In fact, a single observation tells us the os- 
cillation's phase. And, with this single bit of information, 
we have learned everything — the full bit that the time se- 
ries contains. Most systems fall somewhere between these 
two extremes. Here, we develop an analysis of the infor- 
mation contained in a single measurement that applies 
across this spectrum. 

Starting from the most basic considerations, we decon- 
struct what a measurement is, using this to directly step 
through and preview the main results. With that fram- 
ing laid out, we reset, introducing and reviewing the rele- 
vant tools available from multivariate information theory 
including several that have been recently proposed. At 
that point, we give a synthesis employing information 
measures and the graphical equivalent of the informa- 
tion diagram. The result is a systematic delineation of 
the kinds of information that the distribution of single 
measurements can contain and their required contexts 
of interpretation. We conclude by indicating what is 
missing in previous answers to the measurement question 
above, identifying what they do and do not contribute, 
and why alternative state-centric analyses are ultimately 
more comprehensive. 

II. A MEASUREMENT: A SYNOPSIS 
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For our purposes an instrument is simply an interface 
between an observer and the system to which it attends. 
All the observer sees is the instrument's output — here, we 
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take this to be one of k discrete values. And, from a series 
of these outputs, the observer's goal is to infer and to 
understand as much about the system as possible — how 
predictable it is, what are the active degrees of freedom, 
what resources are implicated in generating its behavior, 
and the like. 

The first step in reaching the goal is that the observer 
must store at least one measurement. How many decimal 
digits must its storage device have? To specify which one 
of k instrument outputs occurred the device must use 
log 10 k decimal digits. If the device stores binary values, 
then it must provide log 2 k bits of storage. This is the 
maximum for a one-time measurement. If we perform 
a series of n measurements, then the observer's storage 
device must have a capacity of n log 2 k bits. 

Imagine, however, that over this series of measure- 
ments it happens that output 1 occurs ri\ times, 2 occurs 
ri2 times, and so on, with k occurring n k times. It turns 
out that the storage device can have much less capac- 
ity; using less, sometimes substantially less, than nlog 2 k 
bits. 

To see this, recall that the number M of possible se- 
quences of n measurements with rii, 112, . . . , n k counts is 
given by the multinomial coefficient: 

«=( - ) 

n\ 

nil ■ ■ ■ n k \ 

So, to specify which sequence occurred we need no more 
than: 

k log 2 n + log 2 M + log 2 n + ■ ■ ■ 

The first term is the maximum number of bits to store the 
count rii of each of the k output values. The second term 
is the number of bits needed to specify the particular 
observed sequence within the class of sequences that have 
counts 7ii,Ti2, . . . ,7ifc. The third term is the number b of 
bits to specify the number of bits in n itself. Finally, the 
ellipsis indicates that we have to specify the number of 
bits to specify b (log 2 log 2 n) and so on, until there is less 
than one bit. 

We can make sense of this and so develop a help- 
ful comparison to the original storage estimate of 
n log 2 k bits, if we apply Stirling's approximation: n\ k, 
\Jl-Kn (n/e) n . For a sufficiently long measurement series, 
a little algebra gives: 

k 

log 2 M « -n V — log 2 — 
t-^ n n 

i=l 

= nH[ni/n,n 2 /n,. . . ,n k /n] . 



bits for n observations. Here, the function H[P] 
is Shannon's entropy of the distribution P = 
(ni/n, n 2 /n, . . . ,nk/n). As a shorthand, when discussing 
the information in a random variable X that is dis- 
tributed according to P, we also write H[X\. Thus, to the 
extent that H[X] < log 2 k, as the series length n grows 
the observer can effectively compress the original series 
of observations and so use less storage than rilog 2 k. 

The relationship between the raw measurement 
(log 2 k) and the average-case view (H [X]), that we just 
laid out explicitly, is illustrated in the contrast between 
Figs. [jja) and[l|b). The difference i?i = log 2 k - H[X] 
is the amount of redundant information in the raw mea- 
surements. As such, the magnitude of R\ indicates how 
much they can be compressed. 

Information storage can be reduced further, since us- 
ing H [X] as the amount of information in a measurement 
implicitly assumed the instrument's outputs were statis- 
tically independent. And this, as it turns out, leads to 
H[X] being an overestimate as to the amount of infor- 
mation in X. For general information sources, there are 
correlations and restrictions between successive measure- 
ments that violate this independence assumption and, 
helpfully, we can use these to further compress sequences 
of measurements — X\, A 2 , . . . , X(. Concretely, informa- 
tion theory tells us that the irreducible information per 
observation is given by the Shannon entropy rate: 



where H(£) = — Y^{x 1 } Pifa*) 1°S2 P r (a^) is the block en- 
tropy — the Shannon entropy of the length-^ 1 word distri- 
bution Pr(a^). 

The improved view of the information in a measure- 
ment is given in Fig. [ljc). Specifically, since < H[X], 
we can compress even more; indeed, by an amount 
Roc = log 2 k - 

These comments are no more than a review of basic 
information theory [1] that used a little algebra. They 
do, however, set the stage for a parallel, but more de- 
tailed, analysis of the information in an observation. In 
focusing on a single measurement, the following comple- 
ments recent, more sophisticated analyses of information 
sources that focused on a process's hidden states [21 and 
references therein] . In the sense that the latter is a state- 
centric informational analysis of a process, the following 
takes the complementary measurement-centric view. 

Partly as preview and partly to orient ourselves on the 
path to be followed, we illustrate the main results in a 
pictorial fashion similar to that just given; see Fig. [2] 
which further dissects the information in X. 

As a first cut, the information H[X] provided by each 
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hp and pp. It partitions H[X] into a piece Wp that is 
structural and a piece r M that, as mentioned above, is 



ephemeral. (See Fig. |2jd).) 

With the basic informational components contained in 
a single measurement laid out, we now derive them from 
first principles. The next step is to address information in 
collections of random variables, helpful in a broad array 
of problems. We then specialize to time series; viz., one- 
dimensional chains of random variables. 
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(c) 



FIG. 1. Dissecting information in a single measurement X 
being one of k values. 
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FIG. 2. Systematic dissection of H[X]. 

observation (Fig. [2]ja)) can be broken into two pieces: 
one part is information pp that could be anticipated 
from prior observations and the other hp — the random 
component — is that which could not be anticipated. (See 
Fig. |Jb).) Each of these pieces can be further decom- 
posed into two parts. The random component hp breaks 
into two kinds of randomness: a part bp relevant for 
predicting the future, while the remaining part r M is 
ephemeral, existing only for the moment. 

The redundant portion pp of H[X] in turn splits into 
two pieces. The first part — also bp when the process is 
stationary — is shared between the past and the current 
observation, but its relevance stops there. The second 
piece qp is anticipated by the past, is present currently, 
and also plays a role in future behavior. Notably, this 
informational piece can be negative. (See Fig. [2jc).) 

We can further combine all elements of H[X] that 
participate in structure — whether it be past, future, or 
both — into a single element Wp. This decomposition 
of if provides a very different decomposition than 



III. INFORMATION MEASURES 

Shannon's information theory [I] is a widely used 
mathematical framework with many advantages in the 
study of complex, nonlinear systems. Most importantly, 
it provides a unified quantitative way to analyze systems 
with broadly dissimilar physical substrates. It further 
makes no assumptions as to the types of correlation be- 
tween variables, picking up multi-way nonlinear interac- 
tions just as easily as simple pairwise linear correlations. 

The workhorse of information theory is the Shannon 
entropy of a random variable, just introduced. The en- 
tropy measures what would commonly be considered the 
amount of information learned, on average, from ob- 
serving a sample from that random variable. The en- 
tropy if [X] of a random variable X taking on values 
x G A = {1, . . . , k} with distribution Pr(A = x) has the 
following functional form: 

ff[X]=-5^Pr(x)log 2 Pr(a:). (2) 

The entropy is defined in the same manner over joint 
random variables — say, X and Y — where the above dis- 
tribution is replaced by the joint probability Pr(X, Y). 

When considering more than a single random variable, 
it is quite reasonable to ask how much uncertainty re- 
mains in one variable given knowledge of the other. The 
average entropy in one variable X given the outcome of 
another variable Y is the conditional entropy: 

H[X\Y] = H[X, Y] - H[Y] . (3) 

That is, it is the entropy of the joint random variable 
(X, Y) with the marginal entropy H[Y] of Y subtracted 
from it. 

The fundamental measure of correlation between ran- 
dom variables is the mutual information. As stated be- 
fore, it can be adapted to measure all kinds of interaction 
between two variables. It can be written in several forms, 
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including: 

I[X;Y] =H[X] + H[Y] - H[X,Y] (4) 
=H[X, Y] - H[X\Y] - H[Y\X] . (5) 

Two variables are generally considered independent if 
their mutual information is zero. 

Like the entropy, the mutual information can also 
be conditioned on another variable, say Z , resulting in 
the conditional mutual information. Its definition is a 
straightforward modification of Eq. Q : 

I[X;Y\Z]=H[X\Z]+H[Y\Z]-H[X,Y\Z] . (6) 

For example, consider two random variables X and Y 
that take the values or 1 independently and uniformly, 
and a third Z = X XOR Y, the exclusive-or of the two. 
There is a total of two bits of information among the 
three variables: H[X,Y,Z] = 2 bits. Furthermore, the 
variables X and Y share a single bit of information with 
Z, their parity. Thus, I[X,Y;Z] = 1 bit. Interestingly, 
although X and Y are independent, I[X; Y] — 0, they 
are not conditionally independent: I[X;Y\Z] = 1. 



IV. MULTIVARIATE INFORMATION 
MEASURES 

We now turn to a difficult problem: How does one 
quantify interactions among an arbitrary set of variables? 
As just noted, the mutual information provides a very 
general, widely applicable method of measuring depen- 
dence between two, possibly composite, random vari- 
ables. The challenge comes in the fact that there exist 
several distinct methods for measuring dependence be- 
tween more than two random variables. 

Consider a finite set A and random variables taking 
on values Xi £ A for all i 6 Z. The vector of N random 
variables Xq : n = {Xq, X\, . . . , Xjy_i} takes on values in 
A N . A straightforward generalization of Eq. ^ yields 
the joint entropy: 

H[X 0:N } = - J2 Pr (z(wv) log 2 Pr(x Q:N ) , (7) 

which measures the total amount of information con- 
tained in the joint distribution. From here onward, 
we suppress notating the set {iro : Ar} of realizations over 
which the sums are taken. 

In generalizing the mutual information to arbitrary 
sets of variables, we make use of power sets. We let 
= {0, 1, . . . , N — 1} denote the universal set over the 
variable indices and define P{N) = 'P(^jv) as the power 
set over f2jv. Then, for any set A £ P(N), its comple- 



ment is denoted A = ft^\A. Finally, we use a shorthand 
to refer to the set of random variables corresponding to 
index set A: 

Xa = {Xi : i G A} . (8) 

There are at least three extensions of the two- variable 
mutual information, each based on a different interpre- 
tation of what its original definition intended. The 
first is the multivariate mutual information or co- 
information 3 : I[Xq; X\; . . . ; Xjy-i]- Denoted J[X 0: at], 
it is the amount of mutual information to which all vari- 
ables contribute: 

I[X 0:N ] = - ]T Pr(x 0:N ) log 2 f H Pr(z A )- lW J 

= -J2{-l)WH[X A ] (9) 

AeP(N) 

= H[X 0:N ]-^I[X A \X A ], (10) 
AeP(N) 

0<\A\<N 

where, e.g., 7[X {li3>4} |X^ 2 }] = I[X 1 ;X 3 ;X 4 \X a ,X 2 }. It 
can be verified that Eq. Q is a generalization of Eq. Q , 
adding and subtracting all possible entropies according 
to the number of random variables they include. The 
co-information has several interesting properties. First, 
it can be negative, though a consistent interpretation of 
what this means is still lacking in the literature. Second, 
this measure vanishes if any two variables in the set are 
completely independent. (That is, they are independent 
and also conditionally independent with respect to all 
subsets of the other variables.) This is true regardless of 
interdependencies among the other variables. 

In the second interpretation, the mutual information 
is seen as the relative entropy between a joint distribu- 
tion and the product of its marginals. Specifically, the 
starting point is: 

VM-X^ti^n&L, (ii) 

which is simply a rewriting of Eq. Q . When generalized 
from this form, we obtain the total correlation [4]: 

T[X Q:N ] = J2 Pr( X0 .. N ) log 2 ( p( P ; iX0: Z\ , ) 
^ \Vx{xq) . . .Pr(xjv)/ 

= Y j H[X A ]-H[X*.n] ■ (12) 

AeP(N) 

|A| = 1 

The total correlation is sometimes referred to as the 
"multi-information" , though we refrain from using this 
ambiguous term. It differs from the prior measure in 
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many fundamental ways. To begin with, it is nonnega- 
tive. It also differs in that if Xq is independent of the 
others, then T[X -.n] = T[X 1: n]. Finally, it captures 
only the difference between individual variables and the 
entire set. The role of two-way and higher interactions 
is ignored as it leaves out the relative entropies between 
the entire set and more-than- two- variable marginals. In- 
deed, this is a common problem. The total correlation 
and the next measure miss or, at best, conflate (n > 2)- 
way interactions. 

The last extension stems from the view that mutual in- 
formation is the joint entropy minus all (single-variable) 
unshared information — that is, we start from Eq. 
When interpreted this way, the generalization is called 
the binding information [5]: 

B[X 0:N ] = H[X 0:N ] -Y,H[X A \X A ] . (13) 

AeP(N) 

|A|=1 

Like the total correlation, the binding information is 
nonnegative and independent random variables do not 
change its value. Note that B[Xq-n] is a first approxima- 
tion to the multivariate information of Eq. ^ when the 
sets A are restricted to singleton sets. 

We next define three additional multivariate informa- 
tion measures that have not been studied previously, but 
appear following a similar strategy. First, we have the 
amount of information in individual variables that is not 
shared in any way. This is the residual entropy: 

R[Xo-.n] = H[Xo : n] — B[X 0: n] 

= Y,H[X A \X A ]. (14) 

AeP(N) 

-4| = 1 

In a sense, it is an anti-mutual information: It measures 
the total amount of randomness localized to an individual 
variable and so not correlated to that in its peers. 

Second, we can sum the total correlation and the bind- 
ing information. Then we have the local exogenous infor- 
mation: 

W[X -.n] = B[X 0:N ] + T[X :n] (15) 

= ^(H[X A ]-H[X A \X A ]) (16) 
AeP(N) 

1-41 = 1 

= Y / nx A ;X A ] . (17) 

AeP(N) 

l-4| = l 

It is the amount of information in each variable that 
comes from its peers. It is a "very mutual" information, 
one that discounts for the randomness produced locally — 
that randomness inherent in each variable individually. 



X.o Xq X\; 

X-3 X_2 X_i Xq X\ X2 X3 

FIG. 3. A process's time series: Time indices less than zero 
refer to the past X-o; index to the present Xo; and times 
after to the future Xi : . 

W[Xq-n] is close to the binding information, except 
that it uses the sum of marginals not the joint entropy. 
As such, it seems to more consistently capture the role 
of single variables within a set than B[Xq-n], which com- 
pares the set's joint entropy to individual residual uncer- 
tainties. 

Third and finally, there is a measure which, for lack of 
a better name, we call the enigmatic information: 

Q[X 0:N ] = T[X 0:N ] - B[X Q:N ] . (18) 

Like the multivariate mutual information — which it 
equals when N = 3 — it can be negative. Its operational 
meaning will become clear on further discussion. 



V. TIME SERIES 

We now adapt the general multivariate measures to an- 
alyze discrete- valued, discrete-time series generated by a 
stationary process. That is, rather than analyzing sets 
of random variables, we specialize to a one-dimensional 
chain of them. In this setting, the measures are most ap- 
propriately applied to successively longer blocks of con- 
secutive observations. This allows us to study the asymp- 
totic block-length behavior of each, mimicking the ap- 
proach of Ref. [2j |6] . For the class of processes known 
as finitary (defined shortly) , each of these measures tend 
to a linear asymptote characterized by a subextensive 
component and an extensive component controlled by an 
asymptotic growth rate. 

Let's first state more precisely and introduce the nota- 
tion for the class of processes that are the object of study. 
We consider a bi-infinite chain . . . A„ 1 A A 1 ... of ran- 
dom variables. Each X t ,t £ Z, takes on a finite set of 
values x t € A. We denote contiguous subsets of the time 
series with X A -b where the left index is inclusive and the 
right is exclusive. By leaving one of the indices off the 
subset is partially infinite in that direction. We divide 
this bi-infinite chain into three segments. First we single 
out the present Xq. All the symbols prior to the present 
are the past X :0 . The symbols following the present are 
the future X\-. Figure [3] illustrates the setting. 

Our focus is on the ^-blocks X t: t+i = 
X t X t+ i ■ ■ ■ Xt+i-i- The associated process is spec- 
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ified by the set of length-^ word distributions: 
{Pr(X t:t+e ) : t e Z,£ € N}. We consider only sta- 
tionary processes for which Pr(X t ; t+ e) = Pt(Xq-i). 
And so, we drop the absolute-time index t. More pre- 
cisely, the word probabilities derive from an underlying 
time-shift invariant, ergodic measure \i on the space of 
bi-infinite sequences. 

In the following, an information measure J- applied to 
to the process's length-f words is denoted ^[Xo^] or, as 
a shorthand, T(t). 



a 



— m) 


— i 1 1 1 

— - ^^^^^ 


i i i i i 



1 2 3 4 5 



Block length I [symbols] 



A. Block Entropy versus Total Correlation 

We begin with the long-studied block entropy informa- 
tion measure H{£) [7JIH]- (For a review and background 
to the following see Ref. [6].) The block entropy curve 
defines two primary features. First, its growth rate limits 
to the entropy rate h^. Second, its subextensive compo- 
nent is the excess entropy E: 



E — i[x. ;X .] , 



(19) 



which expresses the totality of information shared be- 
tween the past and future. 

The entropy rate and excess entropy, and the way in 
which they are approached with increasing block length, 
are commonly used quantifiers for complexity in many 
fields. They are complementary in the sense that, for 
finitary processes, the block entropy for sufficiently long 
blocks takes the form: 



H(£) ~ E + t\ 



(20) 



Recall that H (0) = and that H {£) is monotone increas- 
ing and concave down. The finitary processes, mentioned 
above, are those with finite E. 

Next, we turn to a less well studied measure for 
time series — the block total correlation T(£). Adapting 



Eq. ( 12 ) to a stationary process gives its definition: 



T{£) = £H[X Q ] - H{£) . 



(21) 



Note that T(0) = and T(l) = 0. Effectively, it com- 
pares a process's block entropy to the case of indepen- 
dent, identically distributed random variables. In many 
ways, the block total correlation is the reverse side of an 
information-theoretic coin for which the block entropy is 
the obverse. For finitary processes, its growth rate limits 
to a constant p M and its subextensive part is a constant 
that turns out to be — E: 



T{£) ~ -E + Ip, 



(22) 



FIG. 4. Block entropy H(l) and block total correlation T(£) 
illustrating their behaviors for the NRPS Process. 



That is, = lim^oo T(£)/£. Finally, T{£) is monotone 
increasing, but concave up. All of this is derived directly 



from Eqs. (20) and (21), by using well known properties 



of the block entropy. 

The block entropy and block total correlation are plot- 
ted in Fig. |4j Both measures are at I — and from 
there approach their asymptotic behavior, denoted by 
the dashed lines. Though their asymptotic slopes appear 
to be the same, they in fact differ. Numerical data for 
the asymptotic values can be found in Tables [I] and [TT| 
under the heading NRPS (defined later). 

There is a persistent confusion in the neuroscience, 
complex systems, and information theory literatures con- 
cerning the relationship between block entropy and block 
total correlation. This can be alleviated by explicitly 
demonstrating a partial symmetry between the two in 
the time series setting and by highlighting a weakness of 
the total correlation. 

We begin by showing how, for stationary processes, 
the block entropy and the block total correlation contain 
much the same information. From Eqs. ^ and ( 12 ) we 
immediately see that: 



H{£) + T{£) = £H(1) 



(23) 



Furthermore, by substituting Eqs. (20) and (22) in 



Eq. (23) we note that the righthand side has no subex- 
tensive component. This gives further proof that the 



subextensive components of Eqs. (20) and (22) must be 
equal and opposite, as claimed. Moreover, by equating 
individual £-tcrms we find: 



= HQ.) 



(24) 



And, this is the decomposition given in Fig. gb): the 
lefthand side provides two pieces comprising the single- 
observation entropy H(l). 
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Continuing, either information measure can be used to 
obtain the excess entropy. In addition, since the block en- 
tropy provides as well as intrinsically containing 
Pn can be directly obtained from the block entropy func- 
tion by taking H(l) — h^, yielding p^. The same is not 
true, however, for the total correlation. Though p M can 
be computed, one cannot obtain from T{£) alone — 
H(l) is required, but not available from Til), since it is 
subtracted out. 

There are further parallels between the two quantities 
that can be drawn. First, following Ref. [BJ, we define 
discrete derivatives of the block measures at length I; 



hi = H{£) - H(£ - 1) 
p t = T(£) - T{1 - 1) . 



(25) 
(26) 



These approach /i M and p^, respectively. From them we 
can determine the subextensive components by discrete 
integration, while subtracting out the asymptotic behav- 
ior. We find that: 



(27) 



f=i 



and also that 



E = - J2 (pt - 



(28) 



Second, these sums are equal term by term. 

The first sum, however, indirectly brings us back to 



Eq. (24). Since hi = H(l), we have: 

oo 

E = p» + ( he ~ M 



(29) 



1=2 



Finally, it has been said that the total correlation 
( "multi- information" ) is the first term in E [10] . This 
has perhaps given the impression that the total corre- 
lation is only useful as a crude approximation. Equa- 



tion (29) shows that it is actually the total correlation 



rate p^ that is E's first term. As we just showed, the 
total correlation is more useful than being a first term 
in an expansion. Its utility is ultimately limited, though, 
since its properties are redundant with that of the block 
entropy which, in addition, gives the process's entropy 
rate h„. 



information, and residual entropy constitute a refinement 
of the single-measurement decomposition provided by the 
block entropy and the total correlation [5J [TT] . To begin, 
their block equivalents are, respectively: 



B(£) = H(£) - R(£) 
Q(£) = T(£) - B{£) 
W{t)=B(l)+T{l) , 



(30) 
(31) 
(32) 



where R{£) does not have an analogously simple form. 
Their asymptotic behaviors are, respectively: 



£r„ 



R{£) 

B(t) ~ E £ 
Q(£)~E Q +% 
W(£) ~E W + lw„ 



(33) 
(34) 
(35) 
(36) 



Their associated rates break the prior two components 
(h^ and p^) into finer pieces. Substituting their defini- 
tions into Eqs. ^ and (21) we have: 



H{£) = B{£) + R{£) 

= (E B + E fl ) + £{b^ + r M ) 
T{1) = B{£) + Q(£) 

= (E B +E Q )+% t + Qfl ) 



(37) 
(38) 
(39) 
(40) 



The rates in Eqs. (38) and (40) corresponding to h 



and pp, respectively, give the decomposition laid out in 
Fig. [2jc) above. Two of these components (6 M and ry,) 
were defined in Ref. [5] and the third (q^) is a direct ex- 
tension. We defer interpreting them to Sec. VI B which 



provides greater understanding by appealing to the se- 
mantics afforded by the process information diagram de- 
veloped there. 

The local exogenous information, rather than refining 
the decomposition provided by the block entropy and the 
total correlation, provides a different decomposition: 



So, w 



W(£) =B{£) + T(£) 

=(E B -E)+£(b^ 



(41) 
(42) 



as mentioned in Fig. |2j 
Similar to Eq. (|23 1 , we can take the local exogenous 



information together with the residual entropy and find: 



R{£) + W{£) = IH{1) 



(43) 



B. A Finer Decomposition 

We now show how, in the time series setting, the bind- 
ing information, local exogenous information, enigmatic 



This implies that ~Er = — Eyy and that r M and are 
yet another partitioning of H[X], as shown earlier in 
Fig.gd). 

Figure [5] illustrates these four block measures for a 
generic process. Each of the four measures reaches 
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FIG. 5. Block equivalents of the residual entropy R(£), bind- 
ing information B(£), enigmatic information Q(£), and local 
exogenous information W(£) for a generic process (same as 
previous figure). 

asymptotic linear behavior at a length of I = 9 symbols. 
Once there, we see that they each possess a slope that we 
just showed to be a decomposition of the slopes from the 
measures in Fig. [4] Furthermore, each has a subextensive 
component that is found as the y-intercept of the linear 
asymptote. These subextensive parts provide a decom- 
position of the excess entropy, discussed further below in 
Sec. lVIB3l 

C. Multivariate Mutual Information 

Lastly, we come to the block equivalent of the multi- 
variate mutual information I[Xq : n]: 

I{i) = H{t)-Y J I[X A \XA]- (44) 
AeP(i) 

0<\A\<1 

Superficially, it scales similarly to the other measures: 

7(£)~I + ^, (45) 

with an asymptotic growth rate « M and a constant subex- 
tensive component I. Yet, it has differing implications 
regarding what it captures in the process. This is drawn 
out by the following propositions, whose proofs appear 
elsewhere. 

The first concerns the subextensive part of 1(1). 
Proposition 1. For all finite-state processes: 

h^>0 =S> lim I{£) = . (46) 

The intuition behind this is fairly straightforward. For 
I(£) to be nonzero, no two observations can be indepen- 



dent. Finite-state processes with positive are stochas- 
tic, however. So, observations become (conditionally) 
decoupled exponentially fast. Thus, for arbitrarily long 
blocks, the first and the last observations tend toward 
independence exponentially and so I{£) limits to 0. 
The second proposition regards the growth rate i^. 

Proposition 2. For all finite-state processes: 

* M = . (47) 



The intuition behind this follows from the first propo- 
sition. If /i M > 0, then it is clear that since I(£) tends 
toward 0, then the slope must also tend toward 0. What 
remains are those processes that are finite state but 
for which = 0. These are the periodic processes. 
For them, i M also vanishes since, although I(£) may be 
nonzero, there is a finite amount of information contained 
in a bi-infinite periodic sequence. Once all this informa- 
tion has been accounted for at a particular block length, 
then for all blocks larger than this there is no additional 
information to gain. And so, i^ decays to 0. 

The final result concerns the subextensive component 

I. 

Proposition 3. For all finite-state processes with 
h„ > 0: 

1 = 0. (48) 



This follows directly from the previous two proposi- 
tions. 

Thus, the block multivariate mutual information is 
qualitatively different from the other block measures. It 
appears to be most interesting for infinitary processes 
with infinite excess entropy. 

Figure [6] demonstrates the general behavior of I(£), 
illustrating the three propositions. The dashed line high- 
lights the asymptotic behavior of I(£): both I and 
vanish. We further see that I(t) is not restricted to pos- 
itive values. It oscillates about until length £ = 11 
where it finally vanishes. 

VI. INFORMATION DIAGRAMS 

Information diagrams [12 provide a graphical and in- 
tuitive way to interpret the information-theoretic rela- 
tionships among variables. In construction and concept, 
they are very similar to Venn diagrams. The key dif- 
ference is that the measure used is a Shannon entropy 
rather than a set size. Additionally, an overlap is not 
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FIG. 6. Block multivariate mutual information I(£) for the 
same example process as before. 



set intersection but rather a mutual information. The 
irreducible intersections are, in fact, elementary atoms 
of a sigma-algebra over the random-variable event space. 
An atom's size reflects the magnitude of one or another 
Shannon information measure — marginal, joint, or con- 
ditional entropy or mutual information. 



A. Four- Variable Information Diagrams 

Using information diagrams we can deepen our un- 
derstanding of the multivariate informations defined 
in Sec. IV Fig. [7] illustrates them for four random 
variables — X, Y, Z, W. There, an atom's shade of gray 
denotes how much weight it carries in the overall value 
of its measure. Consider for example the total corre- 
lation I-diagram in Fig. [7](c). From the definition of 
the total correlation, Eq. ( |12[ ), we see that each vari- 
able provides one count to each of its atoms and then a 
count is removed from each atom. Thus, the atom as- 
sociated with four-way intersection W n X n Y n Z con- 
tained in each of the four variables carries a total weight 
I[W; X;Y; Z] =4-1 = 3. Those atoms contained in 
three variables carry a weight of 2, those shared among 
only two variables a weight of 1, and information solely 
contained in one variable is not counted at all. 

Utilizing the I-diagrams in Fig. [7J we can easily visu- 
alize and intuit how these various information measures 
relate to each other and the distributions they represent. 
In Fig. |7(a)j we find the joint entropy. Since it represents 
all information contained in the distribution with no bias 
to any sort of interaction, we see that it counts each 
and every atom once. The residual entropy, Fig. 7(e) 



is equally easy to interpret: it counts each atom which is 
not shared by two or more variables. 

The distinctions in the menagerie of measures attempt- 
ing to capture interactions among N variables can also 




(a)Joint entropy, Eq. ^ 





(b)Multivariate 
mutual information, 
Eq. |9} 



(c)Total correlation, 
Eq. (pi 





(d)Binding information, 
Eq. (pi 



(e) Residual entropy, 
Eq. (14| 





(f)Local exogenous 
information, Eq. | |17| 



(g)Enigmatic 
information, Eq. Jl8l 



FIG. 7. Four-variable information diagrams for the multivari- 
ate information measures of Sec. |IV| Darker shades of gray 
denote heavier weighting in the corresponding informational 
sum. For example, the atoms to which all four variables con- 
tribute are added thrice to the total correlation and so the 
central atom's weight I[W; X; Y; Z] = 3. 



be easily seen. The multivariate mutual information, 
Fig. |7(b)[ stands out in that it is isolated to a single 
atom, that contained in all variables. This makes it 
clear why the independence of any two of the variables 
leads to a zero value for this measure. The total cor- 
relation, Fig. 7(c)| contains all atoms contained in at 
least two variables and gives higher weight to those con- 
tained in more variables. The local exogenous informa- 
tion, Fig. |7(f)| is similar. It counts the same atoms as the 
total correlation does, but it gives them higher weight. 
Lastly, the binding information, Fig. |7(d)| also counts 
the same atoms, but only weights each of them once re- 
gardless of how many variables they participate in. 



The lone enigmatic information, Fig. 7(g) counts only 
those variables that participate in at least three variables 
and, similar to the total correlation, it counts those that 
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participate in more variables more heavily. 



B. Process Information Diagrams 

Following Ref. [T3] we adapt the multivariate I- 
diagrams just laid out to tracking information in Unitary 
stationary processes. In particular, we develop process 
I-diagrams to explain the information in a single observa- 
tion, as described before in Fig. [2] The resulting process 
I-diagram is displayed in Fig. [8] As we will see, exploring 
the diagram gives a greater, semantic understanding of 
the relationships among the process variables and, as we 
will emphasize, of the internal structure of the process 
itself. 

For all measures, except the multivariate mutual in- 
formation, the extensive rate corresponds to one or more 
atoms in the decomposition of H[Xq\. To begin, we al- 
low H[Xo] to be split in two by the past. This exposes 
two pieces: h^, the part exterior to the past, and p^, the 
part interior. This partitioning has been well studied in 
information theory due to how it naturally arises as one 
observes a sequence. This decomposition is displayed in 



Fig. 9(a 



Taking a step back and including the future in the 
diagram, we obtain a more detailed understanding of 
how information is transmitted in a process. The past 
and the future together divide i?[^o] into four parts; see 
Fig. [9(b)] We will discuss each part shortly. First, how- 
ever, we draw out a different decomposition — that into 
and as seen in Fig. 9(c) From this diagram it is 



easy to see the semantic meaning behind the decompo- 
sition: r M being divorced from any temporal structure, 
while is steeped in it. 

We finally turn to the partitioning shown in Fig. |9(b)| 
The process I-diagram makes it rather transparent in 
which sense is an amount of ephemeral information: 
its atom lies outside both the past and future sets and 
so it exists only in the present moment, having no reper- 
cussions for the future and being no consequence of the 
past. It is the amount of information in the present ob- 
servation neither communicated to the future nor from 
the past. Ref. [5] referred to this as the residual entropy 
rate, as it is the amount of uncertainty that remains in 
the present even after accounting for every other variable 
in the time series. 

Ref. [5] also proposed to use b^ as a measure of struc- 
tural complexity [5] , and we tend to agree. The argument 
for this is intuitive: 6 M is an amount of information that is 
present now, is not explained by the past, but has reper- 
cussions in the future. That is, it is the portion of the 
entropy rate that has consequences. In some contexts 
one may prefer to employ the ratio b^/h^ when 6 M is in- 




FIG. 8. I-diagram anatomy of H[Xo] in the full context of 
time: The past X-q partitions i/[Xo] into two pieces: ft M and 
p M . The future Xq : then partitions those further into r M , two 
fc^s, and g M . This leaves a component a M , shared by the past 
and the future, that is not in the present Xq. 



terpreted an indicator of complex behavior since, for a 
fixed larger values imply less temporal structure 
in the time series. 

Due to stationarity, the mutual information 
I[Xo;X 1: |X ] between the present X and the fu- 
ture Xi : conditioned on the past X : q is the same as the 
mutual information J[X ; Xo|^i : ] between X and the 
past X-o conditioned on the future X\.. Moreover, both 
are 6 M . This lends a symmetry to the process I-diagram 
that does not exist for nonstationary processes. Thus, 
b^ atoms in Fig. [8] are the same size. 

There are two atoms remaining in the process I- 
diagram that have not been discussed in literature. Both 
merit attention. The first is q^ — the information shared 
by the past, the present, and the future. Notably, its 
value can be negative and we discuss this further below in 
Sec. VI B 1 The other piece, denoted <r M , is a component 



of information shared between the past and the future 
that does not exist in the present observation. This piece 
is vital evidence that attempting to understand a pro- 
cess without using a model for its generating mechanism 
is ultimately incomplete. We discuss this point further 
in Sec. IVJBJ below. 



1. Negativity of q M 

The sign of holds valuable information. To see 
what this is we apply the partial information decompo- 
sition [13] to further analyze — I[Xq\ X-.o, X\-] — that 
portion of the present shared with the past and future. 
By decomposing into four pieces — three of which are 
unique — we gain greater insight into the value of q^ and 
also draw out potential asymmetries between the past 
and the future. 

The partial information lattice provides us with a 
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(c) 



FIG. 9. The three decompositions of H[X] from Fig. [5] The 
dissecting lines are identical to those in Fig. [8] 



method to isolate (i) the contributions II{x. }{Xi.} to 
that both the past and the future provide redun- 
dantly, (ii) parts TV{x } and li^Xx } that are uniquely 
provided by the past and the future, respectively, and 
(iii) a part II{x. ,Xi.} that is synergistically provided by 
both the past and the future. Note that, due to station- 
arity, il{x. } = Unci.}- We refer to this as the uniquity 
and denote it i. 

Using Ref. [Mj we see that is equal to the redun- 
dancy minus the synergy of the past and the future, 
when determining the present. Thus, if g M > 0, the past 
and future predominantly contribute information to the 
present. When < 0, however, considering the past and 
the future separately in determining the present misses 
essential correlations. The latter can be teased out if the 
past and future are considered together. 

The process I-diagram (Fig. [8]) showed that the mu- 
tual information between the present and either the past 
or the future is p M . One might suspect from this that 
the past and the future provide the same information 
to the present, but this would be incorrect. Though 
they provide the same quantity of information to the 
present, what that information conveys can differ. This 
is evidence of a process's structural irreversibility; cf. 
Refs. HHHH]. In this light, the redundancy II{Xo}{Xi.} 
between the past and future when considering the present 
is p M — l. Furthermore, the synergy ILrx. ,Xi.} provided 
by the past and the future is equal to — l. 

Taking this all together, we find what we already knew: 
that q^ — — bfj,, The journey to this conclusion, how- 
ever, provided us with deeper insight into what negative 



I[Xo'i X-.0) -Xl:] 




FIG. 10. Partial information decomposition of to m = 
I[Xq; X-o, X\-]. The multivariate mutual information g M is 
given by the redundancy Ilfx.nHXi..} minus the synergy 
n{jf. ,Xi.}- w m- = P» + fr/j is the sum of all atoms in this 
diagram. 

q^ means and into the structure of Wn and the process 
as a whole. 

2. Consequence of ct m : Why we model 

Notably, the final piece of the process I-diagram is not 
part of H [Xq] — not a component of the information in 
a single observation. This is <r M , which represents infor- 
mation that is transmitted from the past to the future, 
but does not go through the currently observed symbol 
Xq . This is readily understood and leads to an important 
conclusion. 

If one believes that the process under study is gen- 
erated according to the laws of physics, then the pro- 
cess's internal physical configuration must store all the 
information from the past that is relevant for generat- 
ing future behavior. Only when the observed process is 
order-1 Markov is it sufficient to keep track of just the 
current observable. For the plethora of processes that are 
not order-1 or that are non-Markovian altogether, we are 
faced with the fact that information relevant for future 
behavior must be stored somehow. And, this fact is re- 
flected in the existence of a^. When > 0, a complete 
description of the process requires accounting for this in- 
ternal configurational or, simply, state information. This 
is why we build models and cannot rely on only collecting 
observation sequences. 

The amount of information shared between X.q and 
Xi : , but ignoring Xq, was previously discussed in 
Ref. [TO] . We now see that the meaning of this informa- 
tion quantity — there denoted X\ — is easily gleaned from 
its components: T\ = g„ + a^. 

Furthermore, in Refs. [S], [TT], and [TB], efficient com- 
putation of and T\ were not provided and the brute 
force estimates are inaccurate and very compute inten- 
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sive. Fortunately, by a direct extension of the meth- 
ods developed in Ref. [15] on bidirectional machines, 
we can easily compute both = H [Xq\Sq , <S-f] and 
I\ = I[Sq ,Si]. This is done by constructing joint prob- 
abilities of forward-time and reverse-time causal states — 
{S + } and {iS - }, respectively — at different time indices 
employing the dynamic of the bidirectional machine. 
This gives closed-form, exact methods of calculating 
these two measures, provided one constructs the process's 
forward and reverse e-machines. 6 M follows directly in this 
case since it is the difference of and r M ; the former is 
also directly calculated from the e-machine. 



3. Decompositions of E 

Using the process I-diagram and the tools provided 
above, three unique decompositions of the excess entropy, 
Eq. (19), can be given. Each provides a different inter- 



pretation of how information is transmitted from the past 
to the future. 



The first is provided by Eqs. (37 1- (40). The subexten- 



sive parts of the block entropy and total correlation there 
determine the excess entropy decomposition. We have: 



E — Es + E^ 

= — Eb — E<- 



-(E W + Eq) . 



(49) 
(50) 

(51) 
(52) 



We leave the meaning behind these decompositions as an 
open problem, but do note that they are distinct from 
those discussed next. 

The second and third decompositions both derive di- 
rectly from the process I-diagram of Fig. § Without 
further work, one can easily see that the excess entropy 
breaks into three pieces, all previously discussed: 



E 



(53) 



And, finally, one can perform the partial information 
decomposition on the mutual information I[X : o; Xq, X\.]. 
The result gives an improved understanding of (i) how 
much information is uniquely shared with the either the 
immediate or the more distant future and (ii) how much 
is redundantly or synergistically with both. 

The decompositions provided by the atoms of the pro- 
cess I-diagram and those provided by the subextensive 
rates of block-information curves are conceptually quite 
different. It has been shown [T7] that the subextensive 
part of the block entropy and the mutual information be- 
tween the past and the future, though equal for one di- 



mensional processes, differ in two dimensions. We believe 
the semantic differences shown here arc evidence that 
the degeneracy of alternate E-decompositions breaks in 
higher dimensions. 



VII. EXAMPLES 

We now make the preceding concrete by calculating 
these quantities for three different processes, selected to 
illustrate a variety of informational properties. Figure [TT1 
gives each process via it's e-machine [IS]: the Even Pro- 
cess, the Golden Mean Process, and the Noisy Random 
Phase-Slip (NRPS) Process. A process's e-machine con- 
sists of its causal states — a partitioning of infinite pasts 
into sets that give rise to the same predictions about fu- 
ture behavior. The state transitions are labeled p\s where 
s is the observed symbol and p is the conditional proba- 
bility of observing that symbol given the state the process 
is in. The e-machine representation for a process is its 
minimal unifilar presentation. 

Table |T] begins by showing the single-observation en- 
tropy H[l] followed by and p^. Note that the Even 
and the Golden Mean Processes cannot be differentiated 
using these measures alone. The table then follows with 
the finer decomposition. We now see that the processes 
can be differentiated. We can understand fairly easily 
that the Even Process, being infinite-order Markovian, 
and consisting of blocks of Is of even length separated by 
one or more 0s, exhibits more structure than the Golden 
Mean Process. (This is rather intuitive if one recalls that 
the Golden Mean Process has only a single restriction: it 
cannot generate sequences with consecutive 0s.) We see 
that, for the Even Process, r„ is 0. This can be under- 
stood by considering a bi-infinite sample from the Even 
Process with a single gap in it. The structure of this pro- 
cess is such that we can always and immediately identify 
what that missing symbol must be. 

These two processes are further differentiated by q^, 
where it is negative for the Even Process and positive for 
the Golden Mean Process. On the one hand, this implies 
that there is a larger amount of synergy than redundancy 
in the Even Process. Indeed, it is often the case, when 
appealing only to the past or the future, that one cannot 
determine the value of Xq, but when taken together the 
possibilities are limited to a single symbol. On the other 
hand, since is positive for the Golden Mean Process we 
can determine that its behavior is dominated by redun- 
dant contributions. That is larger for the Even Pro- 
cess than the Golden Mean Process is consonant with the 
impression that the former is, overall, more structured. 

The next value in the table is er M , the amount of state 
information not contained in the current observable. This 
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(a)Even Process (b)Golden Mean 
Process 




(c)Noisy Random Phase-Slip Process 

FIG. 11. e-Machine presentations for the three example pro- 
cesses. 



vanishes for the Golden Mean Process, as it is order-1 
Markovian. The Even Process, however, has a significant 
amount of information stored that is not observable in 
the present. 

Last in the table is a partial information decomposition 
of I[X ; X , X 1: ]. is given by U {x . }{ Xl ..} - n {x :0 ,x 1: }- 
Of note here is that the NRPS process's nonzero uniquity 
i = 0.02437. For the Even and Golden Mean Processes 
it vanishes. That is, in the NRPS Process information is 
uniquely communicated to the present from the past and 
an equivalent in magnitude, but different, information is 
communicated to the future. Thus, the NRPS Process 
illustrates a subtle asymmetry in statistical structure. 

Table |TT] then provides an alternate breakdown of E for 
each prototype process. We use this here to only high- 
light how much the processes differ in character from one 
another. The consequences of the first decomposition of 
excess entropy — E = + q^ + — follow directly from 





Even 


Golden Mean 


NRPS 


H[l] 


0.91830 


0.91830 


0.97987 


Pp. 


0.66667 
25163 


0.66667 
251 63 


0.50000 
0.47987 


V 

K 
% 

in 


00000 
0.66667 
-0.41504 
0.91830 


45915 
0.20752 
0.04411 
0.45915 


16667 
0.33333 
0.14654 
0.81320 


<jp 


0.66667 


0.00000 


1.09407 


n {x :0 }{x 1; } 

L '■ n {x :0 }> n {x 1: } 

U {X. ,X 1: } 


0.25163 
0.00000 
0.66667 


0.25163 
0.00000 
0.20752 


0.45550 
0.02437 
0.30896 



TABLE I. Information measure analysis of three processes. 





Even 


Golden Mean 


NRPS 


E 


0.91830 


0.25163 


1.57393 


K 

9m 
°> 


0.66667 
-0.41504 
0.66667 


0.20752 
0.04411 
0.00000 


0.33333 
0.14654 
1.09407 


Es 
e q 


4.48470 
-3.56640 

2.64810 
-4.48470 


0.41504 
-0.16341 
-0.08822 
-0.41504 


1.55445 
0.01948 
-1.59342 
-1.55445 


%0){Xl:} 

K { x o} 

U {X ,X 1 .} 


0.25163 
0.00000 
0.00000 
0.66667 


0.04411 
0.20752 
0.00000 
0.00000 


0.47987 
0.00000 
0.76073 
0.33333 



TABLE II. Alternative decompositions of excess entropy E 
for the three prototype processes. 



decompositions into E^ + E^ and — E^ — Eq vary from 
one another significantly. The Even Process has much 
larger values for these pieces than the total E, whereas 
the NRPS process has two values nearly equal to E and 
one very small. The Golden Mean Process falls some- 
where between these two. 

The final excess entropy breakdown is provided by 
the partial information decomposition of I[X. ; X . Xi-]. 
Here, we again see differing properties among the three 
processes. The Even Process consists only of redundancy 

n {* : oH^i : } and s y ner gy ^{x :0 ,x 1: }- The Golden Mean 
Process contains no synergy, a small amount of redun- 
dancy, and most of its information sharing is with the 
present uniquely. The NRPS Process possesses both syn- 
ergy and redundancy, but also a significant amount of 
information shared solely with the more distant future. 



And, finally, Fig. 12 plots how partitions into and 
for the Golden Mean family of processes. This family 
consists of all processes with e-machine structure given 
in Fig. |ll[ b), but where the outgoing transition proba- 
bilities from state A are parametrized. We can easily see 
that for small self-loop transition probabilities, the ma- 
the previous table's discussion. The second and third jority of /i M is consumed by b^. This should be intuitive 
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Self loop probability p 

FIG. 12. The breakdown of for the Golden Mean Process. 
The self-loop probability was varied from to 1, adjusting the 
other edge's probability accordingly. 

since, when the self-loop probability is small, the process 
is nearly periodic and r M should be nearly zero. On the 
other end of the spectrum, when the self-loop probabil- 
ity is large, is mostly consumed by r M . This is again 
intuitive since observations from that process are domi- 
nated by Is and the occasional — which provides all the 
entropy for — has no effect on structure. 

VIII. CONCLUDING REMARKS 

We began by outlining a conceptual decomposition of 
a single observation in a time series: a single observation 
contains a hierarchy of informational components. We 
then made the decomposition concrete using a variety 
of multivariate information measures. Adapting them 
to time series, we showed that their asymptotic growth 
rates are identified with the hierarchical decomposition. 
To unify the various competing views, we provided the 
measurement-centric process I-diagram, demonstrating 
that it concisely reveals the semantic meaning behind 
each component in the hierarchy. 

Once the measurement-centric process I-diagram was 
available, we isolated two components, analyzing in de- 
tail their meaning. We utilized the partial information 
lattice P~3] to refine our understanding of when the past 
and the future redundantly and synergistically inform the 
present. This allowed us to explain a subtle statistical 
asymmetry — the directionality in the difference between 

Pu and n {x :0 }{x 1: }- 

The other atom we singled out in the process I-diagram 
was a^. It is the most compelling evidence that ana- 
lyzing a process from its measurements alone, without 



constructing a state-based model, is ultimately limited. 

Next, we discussed how the different methods and mea- 
sures relate to one of the most widely used complexity 
measures — the past-future mutual information or excess 
entropy. In particular, we showed how they yield four 
distinct decompositions and, in some cases, give useful 
interpretations of what these decompositions mean oper- 
ationally. 

Then, we calculated all the measures for three different 
prototype processes, each highlighting particular features 
of the information-theoretic decompositions. We gave in- 
terpretations of negative mutual informations, as seen in 
q^. The interpretations were consistent, understandable, 
and insightful. There was nothing untoward about neg- 
ative informations. 

By adapting it to the time series setting, we high- 
lighted a key weakness of the total correlation (or multi- 
information). This undoubtedly explains the lack of in- 
terest in using it in the time series setting, though the 
weakness still holds when it is used to analyze any group 
of random variables. The weakness has led to persis- 
tent over-interpretations of what it describes. It also may 
have eclipsed the importance of its more complete analog, 
such as the block entropy, in the settings of networked 
random variables. 

In closing, we take a longer view. There is an expo- 
nential number of possible atoms for TV-way information 
measures. In addition, there is a similarly large number 
possible partial information decompositions for N vari- 
ables. This diversity presents the possibility of a large 
number of independent efforts to define and uniquely mo- 
tivate why one or the other information measure is the 
best. Indeed, many of these yet-to-be-explored measures 
may be useful. In this light, there is a bright future for 
developing information measures adapted to a wide range 
of nonlinear, complex systems. And, helpfully, a unifying 
framework appears to be emerging. 
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